Reproducible Research Workflows

Lessons from Writing a Dissertation with Quarto

Eyayaw Beze

AMS Institute

January 29, 2026

My old workflow

  1. Writing happens in a Word document
  2. Open dataset in Excel → summary stats → copy/paste → describe
  3. Create figure in Stata/QGIS → save to file → paste into Word → manual formatting
  4. Present at seminars/confs, submit paper, …
  5. “Reviewer #2: please change var x threshold”
  6. 😰 “How did I clean the data?” “How did I make that figure?” “How many things should I update?”

Sounds familiar? 🤔

The Core Problem

Our analyses live in fragments:

Step Tool Format
Data cleaning Excel/Python .xlsx, .py
Statistics Stata/R .do, .R, .py
Figures QGIS/matplotlib .png, .jpeg, .pdf
Writing Word/LaTeX .docx, .tex
Presentation PowerPoint .pptx
  • Manual copy-paste between tools
  • No single source of truth, version chaos

Why Reproducibility Matters

  • Pick up where you left off—even after months
  • One change updates everything
  • “It worked on my machine” \(\downarrow\)
  • Reproducibility crisis is real1
  • Journals increasingly require code/data
  • Reviewers can verify your results
  • Others can build on your work

The goal of this talk

Move from fragmented to integrated, reproducible workflows

From

data.xlsx
analysis_v3_FINAL.do
figure1_old.png
first_draft.docx
first_draft_rev.docx
final_draft_v3.docx
final_final_draft.docx
final_final_draft.pdf

To

project/
├── data/raw/
├── data/processed/
├── docs/
├── scripts/
├── output/
├── README.md
├── article.pdf  # ← generated
└── article.qmd  # ← single source

Load data, run analysis, generate figures/tables, render to PDF/Word

Core Pillars of RRW

  1. Project Organization
  2. Version Controlling
  3. Dependency Management
  4. Literate Programming

1. Project Organization

A consistent folder structure makes everything easier:

project/
├── data/
   ├── raw/
   └── processed/
├── docs/
   └── literature/
├── scripts/
   ├── cleaning/
   ├── estimation/
   └── visualization/
├── output/
   └── figs/
└── README.md
dissertation/
├── chapter-1/    # git submodule
├── chapter-2/    # ...
├── chapter-3/    # ...
└── index.qmd     # main document


# dissertation/index.qmd
{{< include chapter-1/index.qmd >}}
\newpage
{{< include chapter-2/index.qmd >}}
\newpage
{{< include chapter-3/index.qmd >}}

A repo per project

  • Each research project lives in its own folder
  • Self-contained: data, code, outputs together
  • For cumulative dissertations: git submodules

2. Version Control with Git

Have you experienced this?

Git tracks changes to your files over time.

Key concepts:

  • Repository: The project folder, tracked by git
  • Commit: A snapshot of the project at a point in time
  • Remote: A backup on GitHub/GitLab
<!--draft.qmd--> → <!--draft.pdf-->

## Introduction

- We analyze housing prices in the Netherlands from 2015 to 2024.
+ We analyze housing prices, in the Netherlands from 2015 to 2024.

Git workflow

# Initialize a repository
git init

# Stage changes
git add scripts/estimate_ols.R

# Commit with a message
git commit -m "create ols regression"

# Push to GitHub
git push origin main

Track:

  • Code/Prose (.R, .py, .qmd)
  • Documentation (README.md, .txt), configuration files …

Not to track (.gitignore):

  • Large data, binary files (.xlsx, .shp, .db)
  • Generated outputs (PDFs, figs)
  • Sensitive info, cache folders …

3. Dependency Management

Our analysis depends on specific package versions.

The problem:

  • Package updates can break our code
  • Collaborators have different versions
  • “It works on my machine”
# Create a new project (creates pyproject.toml)
uv init my-project

# Add dependencies (creates/updates uv.lock)
uv add pandas matplotlib statsmodels

# Run scripts (auto-creates venv, installs deps)
uv run python analysis.py

# Restore on another machine
uv sync  # reads uv.lock, recreates exact environment

uv manages everything: venv, dependencies, lockfile.

See https://docs.astral.sh/uv/

renv creates isolated, reproducible R environments.

# Initialize renv in your project
renv::init()

# Install packages as usual
install.packages("data.table")

# Snapshot your dependencies
renv::snapshot()  # Creates renv.lock

# Restore on another machine
renv::restore()

See https://rstudio.github.io/renv/

4. Literate Programming with Quarto

Combine code, results, and narrative in one document.

Benefits:

  • No copy-paste errors
  • Results update automatically
  • Single source of truth

What is Quarto?

An open-source scientific and technical publishing system.

  • Write in Markdown (plain text)

Quarto Process Graphic: rdatatoolbox

  • Embed code: Python, R, Julia, Observable, Mermaid
  • Render to HTML, PDF, Word, slides, websites, books

Quarto Document Structure

---
title: "House prices dynamic in the Netherlands"
author: Eyayaw Beze
date: last-modified
format: pdf
---

```{r}
#| label: setup
library(data.table)
library(ggplot2)

# Constants
start_year = 2005
end_year = 2025 # max(data$year)
```

## Introduction

We analyze housing prices in the Netherlands from `{r} start_year` to `{r} end_year` period.

```{r}
#| label: import-data
data = fread("data/processed/prices.csv")
```
## Data
We use a novel data from CBS with `{r} nrow(unique(data))` unique transactions.
The summary statistics of the dataset is shown in @tbl-descriptive-stats.
In `{r} start_year` the average house price was `{r} data[year==start_year, mean(price_real)]`.

```{r}
#| label: tbl-descriptive-stats

desc_stats = data[, .(
  min = min(price_real),
  mean = mean(price_real),
  median = quantile(price_real, 0.5),
  q3 = quantile(price_real, 0.75),
  max = max(price_real),
  sd = var(price_real) ** 0.5
  )]

knitr::kable(desc_stats)
```

Demo

Brief demo of quarto features:

Trends of property values in the Netherlands across Municipalities using Waardering Onroerende Zaken (WOZ)

Let’s see Quarto in Action

We’ll analysis housing price trends in the Netherlands using WOZ data.

  • Load data from CSV
  • Exploratory visualizations
  • Regression with inline results
  • Cross-references and citations
  • Render to HTML and PDF

File: ./demo/woz_analysis.qmd

Modular Chapter Organization

You can organize by sections/chs

article/
├── _quarto.yml
├── 00-abstract.qmd
├── 01-intro.qmd
├── 02-literature.qmd
├── 03-method.qmd
├── 04-data.qmd
├── 05-results.qmd
├── 06-conclusion.qmd
├── 07-references.qmd
├── 08-appendix.qmd
├── index.qmd
└── references.bib

Lessons Learned

Starting simple + consistency

  • Organize files consistently
  • Use Git (even just locally as a single author)
  • Document your code, write a README
  • Separate raw data from processed data
  • Record data sources, keep a data dictionary

Common Pitfalls

  • Hardcoded file paths (C:\Users\...)
  • Not tracking package versions
  • Modifying raw data directly
  • Giant monolithic scripts
  • Forgetting to commit/push

Resources

Quarto

Git

Summary

Practice Tool
Directory structure Cookiecutter
Version control Git + GitHub
Pkg dependencies uv/pip/renv
Literate programming Quarto

Questions?

Thank you!

Contact: